ANALYSIS OF LOAN DATA FROM PROSPER by LUCAS O. SOUZA

The dataset refer to loan data by Prosper between the years of X and Y. The dataset is huge - 113,937 observations with 81 variables.

My purpose on this analysis will be to try to predict ProsperScore or ProsperRating based on different variables in the dataset. For this purpose, I will select only loans originated after July 2009, as the variables dictionary indicates that Prosper Rating and Scores were assigned only for loans originated after that date (before that another variable was used, CreditGrade). I will also work only with loans Completed or Defaulted, so I can test for the performance of Score and Rating variables in evaluating the probability of a loan being defaulted.

Finally, I will reduce the number of variables to become easier to process and analyse the data. I have selected around 15 variables from the dataset, by reading the description and considering the relevance to loan analysis, and also trying to avoid covariance between the variables (crude analysis at this point, just using the little domain knowledge I have). Furthermore, along the analysis we will look at each variable individually, then in pairs and finally through a multivariate analysis and select which ones will be most relevant in building a predictive model.

## 'data.frame':    113937 obs. of  81 variables:
##  $ ListingKey                         : Factor w/ 113066 levels "00003546482094282EF90E5",..: 7180 7193 6647 6669 6686 6689 6699 6706 6687 6687 ...
##  $ ListingNumber                      : int  193129 1209647 81716 658116 909464 1074836 750899 768193 1023355 1023355 ...
##  $ ListingCreationDate                : Factor w/ 113064 levels "2005-11-09 20:44:28.847000000",..: 14184 111894 6429 64760 85967 100310 72556 74019 97834 97834 ...
##  $ CreditGrade                        : Factor w/ 9 levels "","A","AA","B",..: 5 1 8 1 1 1 1 1 1 1 ...
##  $ Term                               : int  36 36 36 36 36 60 36 36 36 36 ...
##  $ LoanStatus                         : Factor w/ 12 levels "Cancelled","Chargedoff",..: 3 4 3 4 4 4 4 4 4 4 ...
##  $ ClosedDate                         : Factor w/ 2803 levels "","2005-11-25 00:00:00",..: 1138 1 1263 1 1 1 1 1 1 1 ...
##  $ BorrowerAPR                        : num  0.165 0.12 0.283 0.125 0.246 ...
##  $ BorrowerRate                       : num  0.158 0.092 0.275 0.0974 0.2085 ...
##  $ LenderYield                        : num  0.138 0.082 0.24 0.0874 0.1985 ...
##  $ EstimatedEffectiveYield            : num  NA 0.0796 NA 0.0849 0.1832 ...
##  $ EstimatedLoss                      : num  NA 0.0249 NA 0.0249 0.0925 ...
##  $ EstimatedReturn                    : num  NA 0.0547 NA 0.06 0.0907 ...
##  $ ProsperRating..numeric.            : int  NA 6 NA 6 3 5 2 4 7 7 ...
##  $ ProsperRating..Alpha.              : Factor w/ 8 levels "","A","AA","B",..: 1 2 1 2 6 4 7 5 3 3 ...
##  $ ProsperScore                       : num  NA 7 NA 9 4 10 2 4 9 11 ...
##  $ ListingCategory..numeric.          : int  0 2 0 16 2 1 1 2 7 7 ...
##  $ BorrowerState                      : Factor w/ 52 levels "","AK","AL","AR",..: 7 7 12 12 25 34 18 6 16 16 ...
##  $ Occupation                         : Factor w/ 68 levels "","Accountant/CPA",..: 37 43 37 52 21 43 50 29 24 24 ...
##  $ EmploymentStatus                   : Factor w/ 9 levels "","Employed",..: 9 2 4 2 2 2 2 2 2 2 ...
##  $ EmploymentStatusDuration           : int  2 44 NA 113 44 82 172 103 269 269 ...
##  $ IsBorrowerHomeowner                : Factor w/ 2 levels "False","True": 2 1 1 2 2 2 1 1 2 2 ...
##  $ CurrentlyInGroup                   : Factor w/ 2 levels "False","True": 2 1 2 1 1 1 1 1 1 1 ...
##  $ GroupKey                           : Factor w/ 707 levels "","00343376901312423168731",..: 1 1 335 1 1 1 1 1 1 1 ...
##  $ DateCreditPulled                   : Factor w/ 112992 levels "2005-11-09 00:30:04.487000000",..: 14347 111883 6446 64724 85857 100382 72500 73937 97888 97888 ...
##  $ CreditScoreRangeLower              : int  640 680 480 800 680 740 680 700 820 820 ...
##  $ CreditScoreRangeUpper              : int  659 699 499 819 699 759 699 719 839 839 ...
##  $ FirstRecordedCreditLine            : Factor w/ 11586 levels "","1947-08-24 00:00:00",..: 8639 6617 8927 2247 9498 497 8265 7685 5543 5543 ...
##  $ CurrentCreditLines                 : int  5 14 NA 5 19 21 10 6 17 17 ...
##  $ OpenCreditLines                    : int  4 14 NA 5 19 17 7 6 16 16 ...
##  $ TotalCreditLinespast7years         : int  12 29 3 29 49 49 20 10 32 32 ...
##  $ OpenRevolvingAccounts              : int  1 13 0 7 6 13 6 5 12 12 ...
##  $ OpenRevolvingMonthlyPayment        : num  24 389 0 115 220 1410 214 101 219 219 ...
##  $ InquiriesLast6Months               : int  3 3 0 0 1 0 0 3 1 1 ...
##  $ TotalInquiries                     : num  3 5 1 1 9 2 0 16 6 6 ...
##  $ CurrentDelinquencies               : int  2 0 1 4 0 0 0 0 0 0 ...
##  $ AmountDelinquent                   : num  472 0 NA 10056 0 ...
##  $ DelinquenciesLast7Years            : int  4 0 0 14 0 0 0 0 0 0 ...
##  $ PublicRecordsLast10Years           : int  0 1 0 0 0 0 0 1 0 0 ...
##  $ PublicRecordsLast12Months          : int  0 0 NA 0 0 0 0 0 0 0 ...
##  $ RevolvingCreditBalance             : num  0 3989 NA 1444 6193 ...
##  $ BankcardUtilization                : num  0 0.21 NA 0.04 0.81 0.39 0.72 0.13 0.11 0.11 ...
##  $ AvailableBankcardCredit            : num  1500 10266 NA 30754 695 ...
##  $ TotalTrades                        : num  11 29 NA 26 39 47 16 10 29 29 ...
##  $ TradesNeverDelinquent..percentage. : num  0.81 1 NA 0.76 0.95 1 0.68 0.8 1 1 ...
##  $ TradesOpenedLast6Months            : num  0 2 NA 0 2 0 0 0 1 1 ...
##  $ DebtToIncomeRatio                  : num  0.17 0.18 0.06 0.15 0.26 0.36 0.27 0.24 0.25 0.25 ...
##  $ IncomeRange                        : Factor w/ 8 levels "$0","$1-24,999",..: 4 5 7 4 3 3 4 4 4 4 ...
##  $ IncomeVerifiable                   : Factor w/ 2 levels "False","True": 2 2 2 2 2 2 2 2 2 2 ...
##  $ StatedMonthlyIncome                : num  3083 6125 2083 2875 9583 ...
##  $ LoanKey                            : Factor w/ 113066 levels "00003683605746079487FF7",..: 100337 69837 46303 70776 71387 86505 91250 5425 908 908 ...
##  $ TotalProsperLoans                  : int  NA NA NA NA 1 NA NA NA NA NA ...
##  $ TotalProsperPaymentsBilled         : int  NA NA NA NA 11 NA NA NA NA NA ...
##  $ OnTimeProsperPayments              : int  NA NA NA NA 11 NA NA NA NA NA ...
##  $ ProsperPaymentsLessThanOneMonthLate: int  NA NA NA NA 0 NA NA NA NA NA ...
##  $ ProsperPaymentsOneMonthPlusLate    : int  NA NA NA NA 0 NA NA NA NA NA ...
##  $ ProsperPrincipalBorrowed           : num  NA NA NA NA 11000 NA NA NA NA NA ...
##  $ ProsperPrincipalOutstanding        : num  NA NA NA NA 9948 ...
##  $ ScorexChangeAtTimeOfListing        : int  NA NA NA NA NA NA NA NA NA NA ...
##  $ LoanCurrentDaysDelinquent          : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ LoanFirstDefaultedCycleNumber      : int  NA NA NA NA NA NA NA NA NA NA ...
##  $ LoanMonthsSinceOrigination         : int  78 0 86 16 6 3 11 10 3 3 ...
##  $ LoanNumber                         : int  19141 134815 6466 77296 102670 123257 88353 90051 121268 121268 ...
##  $ LoanOriginalAmount                 : int  9425 10000 3001 10000 15000 15000 3000 10000 10000 10000 ...
##  $ LoanOriginationDate                : Factor w/ 1873 levels "2005-11-15 00:00:00",..: 426 1866 260 1535 1757 1821 1649 1666 1813 1813 ...
##  $ LoanOriginationQuarter             : Factor w/ 33 levels "Q1 2006","Q1 2007",..: 18 8 2 32 24 33 16 16 33 33 ...
##  $ MemberKey                          : Factor w/ 90831 levels "00003397697413387CAF966",..: 11071 10302 33781 54939 19465 48037 60448 40951 26129 26129 ...
##  $ MonthlyLoanPayment                 : num  330 319 123 321 564 ...
##  $ LP_CustomerPayments                : num  11396 0 4187 5143 2820 ...
##  $ LP_CustomerPrincipalPayments       : num  9425 0 3001 4091 1563 ...
##  $ LP_InterestandFees                 : num  1971 0 1186 1052 1257 ...
##  $ LP_ServiceFees                     : num  -133.2 0 -24.2 -108 -60.3 ...
##  $ LP_CollectionFees                  : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ LP_GrossPrincipalLoss              : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ LP_NetPrincipalLoss                : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ LP_NonPrincipalRecoverypayments    : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ PercentFunded                      : num  1 1 1 1 1 1 1 1 1 1 ...
##  $ Recommendations                    : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ InvestmentFromFriendsCount         : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ InvestmentFromFriendsAmount        : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ Investors                          : int  258 1 41 158 20 1 1 1 1 1 ...
## [1] 2007-09-12 00:00:00 2014-03-03 00:00:00 2007-01-17 00:00:00
## [4] 2012-11-01 00:00:00 2013-09-20 00:00:00 2013-12-24 00:00:00
## 1873 Levels: 2005-11-15 00:00:00 2005-11-18 00:00:00 ... 2014-03-12 00:00:00
##         Min.      1st Qu.       Median         Mean      3rd Qu. 
## "2005-11-15" "2008-10-02" "2012-06-26" "2011-07-21" "2013-09-18" 
##         Max. 
## "2014-03-12"
## df$LoanOriginationDate 
##       n missing  unique 
##  113937       0    1873 
## 
## lowest : 2005-11-15 2005-11-18 2005-11-22 2005-11-23 2005-11-25
## highest: 2014-03-06 2014-03-07 2014-03-10 2014-03-11 2014-03-12
## df$ProsperRating..Alpha. 
##       n missing  unique 
##   20775       0       8 
## 
##                  A   AA    B    C    D    E   HR
## Frequency 121 3280 1717 2873 3105 4472 2698 2509
## %           1   16    8   14   15   22   13   12
## df$ProsperScore 
##       n missing  unique    Info    Mean     .05     .10     .25     .50 
##   20654       0      11    0.98   6.491       2       3       5       7 
##     .75     .90     .95 
##       8       9      10 
## 
##             1   2    3    4    5    6    7    8    9   10 11
## Frequency 377 918 1152 1947 2377 3020 2490 3848 2902 1591 32
## %           2   4    6    9   12   15   12   19   14    8  0

With the trimmed dataset, I have 15 variables and 20,622 observations. That is a reasonable size dataset to be analysed. I will proceed with the univariate plots section and take a closer look at each relevant variable individually.

Univariate Plots

##                    LoanKey BorrowerRate LoanOriginalAmount Term LoanStatus
## 16 3EE2364952142596779635D       0.3177               4000   36  Defaulted
## 24 D0623679715048926AB9F4D       0.2419               2000   36  Defaulted
## 27 3B763675825568665C5122A       0.2809               4000   60  Completed
## 28 7E583591759296638A02214       0.0920               4000   36  Completed
## 31 FBCE36430983505912FD996       0.0999              10000   36  Completed
## 34 91E0360126027167265496D       0.0785              16000   36  Completed
##    ProsperRating..Alpha. ProsperScore BorrowerState CurrentDelinquencies
## 16                    HR            5            FL                    0
## 24                     D            5            IL                    0
## 27                     E            3            NY                    0
## 28                     A            9            CO                    0
## 31                     A            9            CA                    1
## 34                    AA           10            IL                    0
##    DebtToIncomeRatio EmploymentStatus IncomeVerifiable
## 16              0.49            Other             True
## 24              0.39         Employed             True
## 27              0.11         Employed             True
## 28              0.26        Full-time             True
## 31              0.11         Employed             True
## 34              0.05         Employed             True
##    InvestmentFromFriendsAmount Recommendations StatedMonthlyIncome
## 16                           0               0            5500.000
## 24                           0               0            2500.000
## 27                           0               0            3885.917
## 28                           0               0            6666.667
## 31                           0               0            3750.000
## 34                           0               0           13083.333
## 'data.frame':    20622 obs. of  15 variables:
##  $ LoanKey                    : Factor w/ 113066 levels "00003683605746079487FF7",..: 27382 92033 25952 55706 111176 64426 22540 4309 50637 84140 ...
##  $ BorrowerRate               : num  0.3177 0.2419 0.2809 0.092 0.0999 ...
##  $ LoanOriginalAmount         : int  4000 2000 4000 4000 10000 16000 3000 4500 5600 4000 ...
##  $ Term                       : int  36 36 60 36 36 36 12 36 36 36 ...
##  $ LoanStatus                 : Factor w/ 12 levels "Cancelled","Chargedoff",..: 5 5 3 3 3 3 3 3 3 3 ...
##  $ ProsperRating..Alpha.      : Factor w/ 8 levels "","A","AA","B",..: 8 6 7 2 2 3 7 6 4 7 ...
##  $ ProsperScore               : num  5 5 3 9 9 10 3 6 8 3 ...
##  $ BorrowerState              : Factor w/ 52 levels "","AK","AL","AR",..: 11 16 36 7 6 16 6 11 47 12 ...
##  $ CurrentDelinquencies       : int  0 0 0 0 1 0 0 1 0 0 ...
##  $ DebtToIncomeRatio          : num  0.49 0.39 0.11 0.26 0.11 0.05 0.28 0.1 0.18 0.4 ...
##  $ EmploymentStatus           : Factor w/ 9 levels "","Employed",..: 6 2 2 3 2 2 2 2 3 2 ...
##  $ IncomeVerifiable           : Factor w/ 2 levels "False","True": 2 2 2 2 2 2 2 2 2 2 ...
##  $ InvestmentFromFriendsAmount: num  0 0 0 0 0 0 0 0 0 0 ...
##  $ Recommendations            : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ StatedMonthlyIncome        : num  5500 2500 3886 6667 3750 ...

First we will look at the distribution for each variable. The continuous variables are best analysed with histograms, while for discrete variables with a few categories I can use geom_bars. I will start with scores and ratings.

There doesn’t seem to be a strong correlation between scores and rating. Scores are right-skewed, peeking at about 8, while ratings peak at rate D. Since I am not sure about the relationship between these variables , I will test them separately when evaluating how well they performed in evaluating loan customers. Let’s see the distribution per status:

## df$LoanStatus 
##       n missing  unique 
##   20622       0       2 
## 
## Completed (19619, 95%), Defaulted (1003, 5%)

Having filtered only Completed and Defaulted Status, we can see that 19,619 of the remaining observations, equivalent to 95% of the dataset, is marked as Completed, while 1,003 or 5% are Defaulted. That skewed distribution is expected since being Defaulted is essentially an anomaly in the process.

Next, I will take a look at some of the general characteristics of the loan, such as value and interest rate

Most of the loans have a total value less than $10,000, peaking at around $5,000. The interest rate has a more flat distribution between 0.075 and 0.275, peaking at around 0.30. It will be more interesting to analyse these variables together. My guess is that lower the $5,000 peak is equivalent to the 0.30 peak on interest rate, since loans with lower amount tend to have a higher interest rate.

Terms are only for 12 months, 36 months or 60 months, with great majority of loans with a 36 months term. There is probably little information we can extract from this variable

Finally, let’s look at the other variables that the bank (and us) can use to predict whether the loan will be paid or defaulted. There are a total of 9 variables we will be analysing. Let’s plot them to see the distribution. We will start with StatedMonthlyIncome

## df$StatedMonthlyIncome 
##       n missing  unique    Info    Mean     .05     .10     .25     .50 
##   20622       0    3200       1    5724    1667    2250    3333    4833 
##     .75     .90     .95 
##    7000    9667   11945 
## 
## lowest : 0.000e+00 8.333e-02 1.417e+00 1.917e+00 2.417e+00
## highest: 1.083e+05 1.088e+05 4.167e+05 4.833e+05 6.185e+05

Stated monthly income has a few outliers. While the top quantile of 95% is $11,945, the maximum value of income is $ 618,500. Due to that I’ve ommitted the first 1% on the histogram to be better able to analyse it. It is a left skewed distribution, peaking at around $5,000 (the average loan was also $5,000).

Another way of looking at the StatedMonthlyIncome data without the interference of outliers is by plotting the log distribution. The log distribution shows a normal curve, with a peak between $1,000 and $10,000.

Now let’s see if these incomes are verifiable and what is the employment status of the loan takers:

Here we look at two variables with categorical distribution, IncomeVerifiable and Employment Status. We can see that most the clients that request loans are employed (Employed, Full-Time or Self-Employed). We also can see that the biggest part of loans are made by a client that has verifiable income.

We can also see how the debt relates to the income:

## df$DebtToIncomeRatio 
##       n missing  unique    Info    Mean     .05     .10     .25     .50 
##   18544    2078     170       1   0.238    0.06    0.08    0.13    0.20 
##     .75     .90     .95 
##    0.29    0.40    0.48 
## 
## lowest :  0.00  0.01  0.02  0.03  0.04
## highest:  5.64  6.51  7.06  7.79 10.01

Debt to income is a ratio, and given the nature of most of the loans (36 months), any rate above 1 is certainly an outlier (if a loan is equal to more than 1 year income, it would be very difficult to pay the loan in 3 years). The describe confirms that - while there are outliers, more than 95% of the data have a ratio below 0.5.

At the visualization we can see the log10 transformed data fits a normal curve, with its peak at around 0.25. That means the amount of the debt is equivalent to a 3 months salary, and 95% of the loans are less than 6 months salary. Debt to Income Ratio is a approximate ratio of two variables included in the dataset, the loan amount and income.

Next we look at Recommendations and CurrentDelinquencies:

## df$Recommendations 
##       n missing  unique    Info    Mean 
##   20622       0       8    0.09 0.04083 
## 
##               0   1  2  3 4 5 9 18
## Frequency 19949 563 91 13 2 1 1  2
## %            97   3  0  0 0 0 0  0
## df$CurrentDelinquencies 
##       n missing  unique    Info    Mean     .05     .10     .25     .50 
##   20622       0      20     0.4  0.3177       0       0       0       0 
##     .75     .90     .95 
##       0       1       2 
## 
## lowest :  0  1  2  3  4, highest: 15 16 17 21 22

These are continuous variable, but with few ocurrences, so we could decide to use them as discrete variables (use a frequency range, such as More than 10 delinquencies), or as a continuous variable.

The describe shows us there are outliers. For Recommendations, close to a 100% are between 0 and 5, with 1 loan with 9 recommendations and 2 with 18 recommendations. For CurrentDelinquencies, 95% is below 2, while the max number of delinquencies is 22.

To better visualize them I’ve trimmed the top 1% percent. In the results we can see the majority (97%) of loans have no recommendations at all. The same way the majority of loan takers have no current delinquencies at the time of the contract.

From the analysis we can infer that Recommendations is a variable with very little information, that might no be useful in the regression (since 97% have no recommendations and nevertheless 95% were not defaulted), but CurrentDelinquencies may be an indicator of a future default.

Since most of the Recommendations is 0, I will transform it to a yes or no variable instead of a continuous.

## df$Recommended 
##       n missing  unique 
##   20622       0       2 
## 
## FALSE (19949, 97%), TRUE (673, 3%)

InvestmentFromFriendsAmount might be a variable that brings more information than Recommendation, since it is quantifiable. Let’s turn to that:

## df$InvestmentFromFriendsAmount 
##       n missing  unique    Info    Mean     .05     .10     .25     .50 
##   20622       0     212    0.06   13.58       0       0       0       0 
##     .75     .90     .95 
##       0       0       0 
## 
## lowest :     0.00    25.00    25.08    25.12    25.83
## highest:  7000.00  7650.00 10000.00 10593.67 11000.00

Here we use the same strategy for StatedMonthlyIncome, plotting the data with no transformation and plotting the data in a log 10 scale. The describe shows us that more than 95% of the loan have no investment from friends. In the few cases there are, the distribution os more or less flat, with a peak at rounded values like $100 and a $1,000.

Like Recommendations, there is probably little information we can extract from this variable as well. 95% of the loans do not default, and more than 95% of the loans have no investment from friends, meaning the variable has little significance in a lon being defaulted or not.

Finally, let’s look at the loan distribution between states:

The univariate analysis of borrower states don’t tell us mych. Except there is an unusual concentration of loans in the state of California. An hypothesis for that is California being early adopters of new technologies (Prosper is a peer-to-peer lending marketplace, not a regular bank).

Univariate Analysis

What is the structure of your dataset?

The original dataset has over 100,000 observations and 81 variables. The subsetted dataset, with only loans that have been Completed or Defaulted, and originated after July 2009, have 20,662 observations. Out of 81 variables, 15 were selected for analysis, which includes continuous and discrete variables.

What is/are the main feature(s) of interest in your dataset?

Upon the univariate analysis, we can say the most interesting variable are the informations on the loan (amount and rate) and informations on loan taker income (monthly income, income verifiable, debt to income ratio and employment status), and other informations such as current delinquencies and borrower state.

What other features in the dataset do you think will help support your investigation into your feature(s) of interest?

Current delinquencies and borrower state are relevant information that may or may not be correlated with a debt being defaulted or not.

Did you create any new variables from existing variables in the dataset?

Since more than 97% of the loans had 0 recommendations, I have transformed Recommendations variable, which is originally numerical, into a binary Yes/No variable called Recommended.

Of the features you investigated, were there any unusual distributions? Did you perform any operations on the data to tidy, adjust, or change the form of the data? If so, why did you do this?

Yes. The variables related to dollar amount suche as StatedMonthlyIncome, DebtToIncomeRatio (ratio of two dollar amount variables) and LoanOriginalAmount are best represented in a log10 scale, fitting a normal curve.

Bivariate Plots Section

I will start the bivariate analysis by plotting score against rate, to see how they are related

Scores and Rating seem to be highly correlated. Given this fact, I will use only score for further analysis, as using both will only add unnecessary complexity.

Let’s see how the score relates to a loan being defaulted or not

Scores are certainly related to a loan being defaulted or not, which means they have some accuracy in their predictions But the difference is not considerable, just 1 point in the average. The observations of loans defaulted have a IQR lower and upper bounder of 4 and 7, while for completed are between 5 and 8.

Let’s analyse how the other variables relates to score and see which ones we can use to predict the score. We will start with the loan main characteristics, its rate, amount, and term. I am using with very low alpha values, since the huge number of observations is causing overplotting in the scatterplots.:

## 
##  Pearson's product-moment correlation
## 
## data:  ProsperScore and LoanOriginalAmount
## t = 17.905, df = 20620, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.1102676 0.1371470
## sample estimates:
##     cor 
## 0.12373

Higher amount loans seems to have a higher score up untill score 8, but there is an odd behavior at score 9 and 10. A correlation score of 0.12 indicates they are slightly correlated, so we should pay attention to this variable.

## 
##  Pearson's product-moment correlation
## 
## data:  ProsperScore and LoanOriginalAmount
## t = 17.905, df = 20620, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.1102676 0.1371470
## sample estimates:
##     cor 
## 0.12373

As expected, there is a clear correlation between the Rate and Score. Rate is nothing more than the reward of the risk taken by the bank, hence rate is directly correlated with the loan’s evaluated score ( in this case we know the correlation is a causation, score causes effect on rate).

## df$Term 
##       n missing  unique 
##   20622       0       3 
## 
## 12 (1459, 7%), 36 (16563, 80%), 60 (2600, 13%)

There is no difference between 12 and 36, but there is a difference between 60 months loans and 12/36 months loans - long term loans have an average score 2 points lower than short term loans. We can convert these to a new variable that states whether it is a long term loan or not, and better represent the difference, as shown below:

## df$LongTermLoan 
##       n missing  unique 
##   20622       0       2 
## 
## FALSE (18022, 87%), TRUE (2600, 13%)

Moving on to current delinquencies:

## 
##  Pearson's product-moment correlation
## 
## data:  ProsperScore and CurrentDelinquencies
## t = -20.269, df = 20620, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.1531224 -0.1263584
## sample estimates:
##        cor 
## -0.1397659

There seems to be a relationship although not very clear, between score and current delinquencies. Scores 9 and 10 have less current delinquencies than lower scores. The correlation test shows us there is a negative -.13 correlation, meaning the scores increase as the current delinquencies decrease, which is expected.

Let’s see how the score changes according to DebtIncomeRatio (in the univariate analysis we have seen debt to income ratio above 1 are outliers, hence I’m limiting the y axis to be between 0 and 1):

## 
##  Pearson's product-moment correlation
## 
## data:  ProsperScore and DebtToIncomeRatio
## t = -23.65, df = 18542, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.1850584 -0.1571151
## sample estimates:
##        cor 
## -0.1711212

Here again we see a negative correlation of -.17, which is somewhat significant. The score increases as the debt to income ratio decreases, which is clear also in the scatterplot. The smoothed line shows the correlation.

Next we will look at the relationship between score and EmploymentStatus using a boxplot, since Employment Status is a categorical value.

Here we can see that loan takers which defined themselves as emploted Full-time, Part-time or Retired have a biggest chance of getting a high score in their evalution, in comparison to loan takers which just defined as ‘Employed’ (not specified how many hours), ‘Self-Employed’, ‘Not-Employed’ or ‘Other’.

It is just the status that matter, or also how much you earn? Let’s look at the relationship between monthly income (removing the top 1% for outliers) and score

## 
##  Pearson's product-moment correlation
## 
## data:  ProsperScore and StatedMonthlyIncome
## t = 7.6464, df = 20620, p-value = 2.154e-14
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.03955406 0.06677404
## sample estimates:
##        cor 
## 0.05317393

At first glance there is not a very clear relationship. Adding a smoothed trend line to the graph we can confirm the trend is slightly turned upwards, but with very little impact - correlation test gives us a 0.05 score.

It makes more sense to analyse income as in relation to the loan amount, so let’s plot that:

## 
##  Pearson's product-moment correlation
## 
## data:  ProsperScore and IncomeOverLoan
## t = 4.2891, df = 20620, p-value = 1.802e-05
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.01621352 0.04348635
## sample estimates:
##        cor 
## 0.02985549

Income over Loan has a very low correlation, and only seem to be of any relevance in the scores 9 and 10. Since we already have the Debt to Income Ratio, I will not be using this variable in the predictive model. Next we will look at if knowing the income is verifiable has any effect on the score:

As we can see in the visualization, loans (takers) with verifiable income have an increase of approximate 2 in the score in relation to the loan takes without verifiable income.

What about recommendations? Does it impact the score somehow? Here we will use the categorical variable created for the analysis, Recommended, which ignores the amount of recommendations.

The visualization shows there is a very little impact, hence it is a variable we can ignore for the linear regression. Let’s analyse InvestmentFromFriendsAmount, which also reflects the support network of the loan taker:

## 
##  Pearson's product-moment correlation
## 
## data:  ProsperScore and InvestmentFromFriendsAmount
## t = -0.2401, df = 20620, p-value = 0.8103
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.01532028  0.01197679
## sample estimates:
##          cor 
## -0.001672059

There is not a clear pattern. We will convert to a binary value, similar to recommended, to see if a patter emerges:

## df$FriendsInvestment 
##       n missing  unique 
##   20622       0       2 
## 
## FALSE (20201, 98%), TRUE (421, 2%)

Now the relationship is more visible. Loans which have investment from friends have an average score 1 point higher than loans without investment from friends.

Finally, let’s look at the state:

Most of the states have loans ranging from 1 to 10. The average scores are situaated between 6 and 8, with only 3 states with an average score of 8, DC, DE and HI. The visualization shows there is some influence between state and score, at least in this dataset.

Let’s wrap up this section by looking at the correlation between all the above mentioned variables and see if we can find an interesting unexpected pattern:

Bivariate Analysis

Talk about some of the relationships you observed in this part of the investigation. How did the feature(s) of interest vary with other features in the dataset?

Our feature of interest is score. Almost all the variables analysed have some impact on score. Employment status varies up to 3 points (varies the average between the different categories). States varies up to 2 points, also does LongTermLoan and IncomeVerifiable. FriendsInvestment varies up to 1 point. DebtToIncomeRatio and CurrentDelinquencies have a high correlation with Score, 0.17 and 0.11 respectively, and varies significantly across the range of scores.

There are variables that can shows very little correlation with score, and can be discarded for the regression. One of them is Recommended - 97% of the loans have no recommendations, and the bivariate analysis shows there is very little impact on the result. StatedMontlyIncome and its variation StatedMonthlyIncome/LoanOriginalAmount have shown no significance correlation with Score, and may generate unnecessary noise for the regression.

Did you observe any interesting relationships between the other features (not the main feature(s) of interest)?

Not any that would draw attention, most of variables seem either related to score or not related at all.

What was the strongest relationship you found?

The strongest relationship is between Score and DebtToIncomeRatio, with a correlation score of 0.17. On the category side, two binary valuables shows a strong relationship with score, LongTermLoan and IncomeVerifiable.

Multivariate Plots Section

Let’s look at our strongest variables together. First we will see how Employment Status combined with IncomeVerifiable affects score:

Analysed side by side we can see that for Prosper being employed is not enough to guarantee a high score, one would also need to verify its income. The distribution between categories also changes - Not employed and Retired have similar scores to FullTime when the income has been verified.

Let’s change the focus from score to BorrowerRate, which are highly correlated variables, and see how other variables impact the BorrowerRate. Given the same DebtToIncomeRatio, do longer term loans have higher or lower borrowing rates?

An interesting pattern emerges from this visualization. We can see that long term loans with the same DebtToIncomeRatio have a higher borrower rate. It is particularly interesting because the financial market usually works different… long term loans, such as mortgages, have lower rates, while short term loans such as working capital usually have a higher rate.

Let’s also analyse these variables against amount.

I am using a log10 transformation scale to better visualize the distribution on y axis. Here we can confirm the strange pattern we’ve seen before. Long term loans, with higher amounts, have higher borrower rates.

Let’s use the same pattern to look at the impact of IncomeVerifiable at the BorrowerRate

Here again we can confirm that given the same loan amount, a loan taker without verifiable income has a higher borrower rate, confirming the correlation between verifiable income and borrower rate/score.

My proposal is for us to try and predict the score using the variables we selected from the univariate, bivariate and multivariate analysis.

We will add the selected variables one by one, and see how they impact the R2 value. In the regression we will not use BorrowerRate, since they are defined from the score, and ProsperRating..Alpha, which is similar to scores. All other relevant variables will be used:

## 
## Calls:
## m1: lm(formula = ProsperScore ~ LoanOriginalAmount, data = df)
## m2: lm(formula = ProsperScore ~ LoanOriginalAmount + LongTermLoan, 
##     data = df)
## m3: lm(formula = ProsperScore ~ LoanOriginalAmount + LongTermLoan + 
##     DebtToIncomeRatio, data = df)
## m4: lm(formula = ProsperScore ~ LoanOriginalAmount + LongTermLoan + 
##     DebtToIncomeRatio + CurrentDelinquencies, data = df)
## m5: lm(formula = ProsperScore ~ LoanOriginalAmount + LongTermLoan + 
##     DebtToIncomeRatio + CurrentDelinquencies + EmploymentStatus, 
##     data = df)
## m6: lm(formula = ProsperScore ~ LoanOriginalAmount + LongTermLoan + 
##     DebtToIncomeRatio + CurrentDelinquencies + EmploymentStatus + 
##     IncomeVerifiable, data = df)
## m7: lm(formula = ProsperScore ~ LoanOriginalAmount + LongTermLoan + 
##     DebtToIncomeRatio + CurrentDelinquencies + EmploymentStatus + 
##     IncomeVerifiable + FriendsInvestment, data = df)
## m8: lm(formula = ProsperScore ~ LoanOriginalAmount + LongTermLoan + 
##     DebtToIncomeRatio + CurrentDelinquencies + EmploymentStatus + 
##     IncomeVerifiable + FriendsInvestment + BorrowerState, data = df)
## 
## ====================================================================================================================================
##                                                m1         m2         m3         m4         m5         m6         m7         m8      
## ------------------------------------------------------------------------------------------------------------------------------------
##   (Intercept)                                6.108***   6.083***   6.501***   6.639***   6.096***   6.096***   6.096***   5.876***  
##                                             (0.026)    (0.026)    (0.029)    (0.030)    (0.033)    (0.033)    (0.033)    (0.324)    
##   LoanOriginalAmount                         0.000***   0.000***   0.000***   0.000***   0.000***   0.000***   0.000***   0.000***  
##                                             (0.000)    (0.000)    (0.000)    (0.000)    (0.000)    (0.000)    (0.000)    (0.000)    
##   LongTermLoan                                         -0.761***  -0.813***  -0.803***  -0.522***  -0.522***  -0.522***  -0.513***  
##                                                        (0.051)    (0.051)    (0.051)    (0.050)    (0.050)    (0.050)    (0.050)    
##   DebtToIncomeRatio                                               -1.160***  -1.194***  -1.173***  -1.173***  -1.173***  -1.166***  
##                                                                   (0.049)    (0.048)    (0.047)    (0.047)    (0.047)    (0.047)    
##   CurrentDelinquencies                                                       -0.278***  -0.250***  -0.250***  -0.250***  -0.248***  
##                                                                              (0.014)    (0.014)    (0.014)    (0.014)    (0.014)    
##   EmploymentStatus: Full-time/Employed                                                   1.251***   1.251***   1.252***   1.252***  
##                                                                                         (0.035)    (0.035)    (0.036)    (0.036)    
##   EmploymentStatus: Not employed/Employed                                                3.216      3.216      3.215      3.024     
##                                                                                         (2.099)    (2.099)    (2.099)    (2.096)    
##   EmploymentStatus: Other/Employed                                                      -0.530***  -0.530***  -0.530***  -0.521***  
##                                                                                         (0.107)    (0.107)    (0.107)    (0.107)    
##   EmploymentStatus: Part-time/Employed                                                   1.483***   1.483***   1.483***   1.492***  
##                                                                                         (0.177)    (0.177)    (0.177)    (0.177)    
##   EmploymentStatus: Retired/Employed                                                     1.000***   1.000***   1.000***   0.995***  
##                                                                                         (0.156)    (0.156)    (0.156)    (0.156)    
##   EmploymentStatus: Self-employed/Employed                                              -4.383*    -4.383*    -4.384*    -4.270*    
##                                                                                         (2.099)    (2.099)    (2.099)    (2.094)    
##   FriendsInvestment                                                                                           -0.019     -0.071     
##                                                                                                               (0.114)    (0.115)    
##   BorrowerState: AL/AK                                                                                                   -0.186     
##                                                                                                                          (0.350)    
##   BorrowerState: AR/AK                                                                                                    0.164     
##                                                                                                                          (0.371)    
##   BorrowerState: AZ/AK                                                                                                   -0.116     
##                                                                                                                          (0.344)    
##   BorrowerState: CA/AK                                                                                                    0.257     
##                                                                                                                          (0.326)    
##   BorrowerState: CO/AK                                                                                                    0.076     
##                                                                                                                          (0.339)    
##   BorrowerState: CT/AK                                                                                                    0.088     
##                                                                                                                          (0.344)    
##   BorrowerState: DC/AK                                                                                                    0.594     
##                                                                                                                          (0.386)    
##   BorrowerState: DE/AK                                                                                                    0.860     
##                                                                                                                          (0.442)    
##   BorrowerState: FL/AK                                                                                                    0.301     
##                                                                                                                          (0.329)    
##   BorrowerState: GA/AK                                                                                                    0.339     
##                                                                                                                          (0.332)    
##   BorrowerState: HI/AK                                                                                                    0.630     
##                                                                                                                          (0.417)    
##   BorrowerState: ID/AK                                                                                                   -0.097     
##                                                                                                                          (0.393)    
##   BorrowerState: IL/AK                                                                                                    0.517     
##                                                                                                                          (0.330)    
##   BorrowerState: IN/AK                                                                                                    0.203     
##                                                                                                                          (0.340)    
##   BorrowerState: KS/AK                                                                                                   -0.109     
##                                                                                                                          (0.360)    
##   BorrowerState: KY/AK                                                                                                    0.274     
##                                                                                                                          (0.358)    
##   BorrowerState: LA/AK                                                                                                   -0.120     
##                                                                                                                          (0.367)    
##   BorrowerState: MA/AK                                                                                                    0.221     
##                                                                                                                          (0.339)    
##   BorrowerState: MD/AK                                                                                                    0.106     
##                                                                                                                          (0.337)    
##   BorrowerState: MI/AK                                                                                                   -0.067     
##                                                                                                                          (0.335)    
##   BorrowerState: MN/AK                                                                                                    0.402     
##                                                                                                                          (0.340)    
##   BorrowerState: MO/AK                                                                                                    0.113     
##                                                                                                                          (0.338)    
##   BorrowerState: MS/AK                                                                                                   -0.259     
##                                                                                                                          (0.378)    
##   BorrowerState: MT/AK                                                                                                   -0.166     
##                                                                                                                          (0.429)    
##   BorrowerState: NC/AK                                                                                                    0.316     
##                                                                                                                          (0.335)    
##   BorrowerState: NE/AK                                                                                                   -0.124     
##                                                                                                                          (0.380)    
##   BorrowerState: NH/AK                                                                                                    0.077     
##                                                                                                                          (0.392)    
##   BorrowerState: NJ/AK                                                                                                    0.213     
##                                                                                                                          (0.336)    
##   BorrowerState: NM/AK                                                                                                   -0.558     
##                                                                                                                          (0.408)    
##   BorrowerState: NV/AK                                                                                                   -0.034     
##                                                                                                                          (0.358)    
##   BorrowerState: NY/AK                                                                                                    0.468     
##                                                                                                                          (0.329)    
##   BorrowerState: OH/AK                                                                                                    0.146     
##                                                                                                                          (0.333)    
##   BorrowerState: OK/AK                                                                                                    0.398     
##                                                                                                                          (0.364)    
##   BorrowerState: OR/AK                                                                                                    0.157     
##                                                                                                                          (0.344)    
##   BorrowerState: PA/AK                                                                                                    0.020     
##                                                                                                                          (0.335)    
##   BorrowerState: RI/AK                                                                                                    0.329     
##                                                                                                                          (0.393)    
##   BorrowerState: SC/AK                                                                                                    0.286     
##                                                                                                                          (0.358)    
##   BorrowerState: SD/AK                                                                                                   -0.153     
##                                                                                                                          (0.460)    
##   BorrowerState: TN/AK                                                                                                    0.291     
##                                                                                                                          (0.342)    
##   BorrowerState: TX/AK                                                                                                    0.106     
##                                                                                                                          (0.329)    
##   BorrowerState: UT/AK                                                                                                   -0.162     
##                                                                                                                          (0.365)    
##   BorrowerState: VA/AK                                                                                                    0.327     
##                                                                                                                          (0.333)    
##   BorrowerState: VT/AK                                                                                                   -0.198     
##                                                                                                                          (0.516)    
##   BorrowerState: WA/AK                                                                                                    0.306     
##                                                                                                                          (0.335)    
##   BorrowerState: WI/AK                                                                                                    0.573     
##                                                                                                                          (0.343)    
##   BorrowerState: WV/AK                                                                                                    0.223     
##                                                                                                                          (0.412)    
##   BorrowerState: WY/AK                                                                                                    0.127     
##                                                                                                                          (0.560)    
## ------------------------------------------------------------------------------------------------------------------------------------
##   R-squared                                      0.02       0.03       0.06       0.08       0.14       0.14       0.14       0.15  
##   adj. R-squared                                 0.02       0.03       0.06       0.08       0.14       0.14       0.14       0.14  
##   sigma                                          2.28       2.27       2.20       2.17       2.10       2.10       2.10       2.09  
##   F                                            320.58     272.96     373.29     382.71     302.29     302.29     274.80      55.10  
##   p                                              0.00       0.00       0.00       0.00       0.00       0.00       0.00       0.00  
##   Log-likelihood                            -46296.43  -46186.05  -40909.54  -40717.68  -40052.36  -40052.36  -40052.35  -39974.65  
##   Deviance                                  107606.95  106461.17   89517.00   87683.68   81612.35   81612.35   81612.22   80931.20  
##   AIC                                        92598.86   92380.10   81829.08   81447.36   80128.73   80128.73   80130.70   80069.31  
##   BIC                                        92622.66   92411.84   81868.22   81494.32   80222.66   80222.66   80232.46   80538.98  
##   N                                          20622      20622      18544      18544      18544      18544      18544      18544     
## ====================================================================================================================================

As seen, adding dimensions increased our linear regression model from an R-squared of 0.02 to an R-squared of 0.15, which means the final model only explains about 15% of the variance in scores. That is not a good model, and we would need to use a lot more variables to achieve a reasonable value. Nevertheless we are in the right path, as adding each of the variables increased R-squared a little bit.

Multivariate Analysis

Talk about some of the relationships you observed in this part of the investigation. Were there features that strengthened each other in terms of looking at your feature(s) of interest?

Most of features are either correlated or not. Two features that are stronger combined is EmploymentStatus and IncomeVerifiable, meaning it is not enough to report employment status to have a higher score, it also needs to be confirmed

Were there any interesting or surprising interactions between features?

There was an interesting pattern I have perceived and wanted to analyse further, that is how longer term loans, with more amount, have higher borrower rates. This is unusual, since longer term loans tend to have lower borrowing rates, such as mortgages. I’m assuming longer term loans have more guarantees, as expected, but I can not confirm this is the case since I don’t have this variable in my selected dataset. The reason why this pattern emerges might means that guarantees are not an important part of the loan process in Prosper, perhaps not even required.

OPTIONAL: Did you create any models with your dataset? Discuss the strengths and limitations of your model.

Yes, the idea from the beggining was to try to predict scores with a linear regression model. The final uses 2 features from the loan, amount and term, and 6 features regarding the loan and the loan taker, DebtToIncomeRatio, CurrentDelinquencies, EmploymentStatus, IncomeVerifiable, FriendsInvestment and BorrowerState.

The final R-squared is 0.15. Although it only explains 15% of the variance in the scores dataset, the incremental linear regression shows that each variable added improved the model a little bit, and that we could improve it further by adding more variables that were not selected in this dataset.


Final Plots and Summary

Plot One

## 
##  Pearson's product-moment correlation
## 
## data:  ProsperScore and BorrowerRate
## t = -147.13, df = 20620, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.7222461 -0.7089282
## sample estimates:
##        cor 
## -0.7156522

Description One

Bivariate analysis between BorrowerRate and Score, showing how they are correlated (in this case correlation means causation, as borrower rate is defined by the loan score amongst other factors. As the score increases, the borrower rate decreases.

We are interested here in seeing if the rate varies according to the score, as it is expected, or if there are other factors driving the rate. From the visualization we can infer that the loan score has a high effect on the score, meaning it is used to determine the rate of the customer and hence reflects accurately the customer evaluation made by Prosper. As the correlation shows, the variance of the loan score can explain more than 70% of the variance in the borrower rate.

The visualization also shows the relationship is non-linear. Scores from 1 to 5 have an approximate borrower rate, close to 0.3, and there is a sudden decrease from 5 to 10, reaching below 0.1.

Plot Two

Description Two

In this visualization we analyse one variable deeper to see how it affect the scores, Income Verifiable. We can see, from the visualization, how the fact of the income being verifiable or not affects the score, given the same employment status. In almost all categories, including Non-Employed, having a verifiable income have guaranteed a rate 1 or 2 points higher than loan takers without verifiable income. This means that Prosper takes into consideration not only the declared status but also what the loan taker can prove.

Another visible aspect is that while the mean score per employment status differ if the income is not verified, they do not differ as much if the income is verified. In the second category, loan takers declared as Full Time, Not Employed, Part-Time or even Retired have been given, on average, the same loan score.

Plot Three

## 
##  Pearson's product-moment correlation
## 
## data:  ProsperScore and DebtToIncomeRatio
## t = -23.65, df = 18542, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.1850584 -0.1571151
## sample estimates:
##        cor 
## -0.1711212

The relationship between borrower rate and debt to income ratio is depicted in the visualization, splitting the observations between loans which are long term or short/medium term. The visualization shows an unexpected pattern, which has been hinted in previous visualizations: long term loans have a higher borrower rate, given the same debt to income ratio (debt to income ratio is the variable that mostly affects the score, with a -.17 correlation).

The pattern is unexpected because long term loans tend to have a lower borrower rate, since they usually have some kind of guarantee attached, such as mortgages on houses and cars, as opposed to short term loans which are cash flow oriented. We can assume that, at Prosper, long term loans (60 months) and short(12) and medium term(36) loans are treated equal, probably with none of them requiring a guarantee from the loan taker.

Description Three

Relationship between Borrower Rate, Debt To Income Ration and Long Term Loan. The visualization shows the purple dots right centered, while the yellow dots are distributed more evenly. In other words, given the same Debt To Income Ratio, the borrower rate will be higher if the loan is long term.

Reflection

Along this work I have conducted a very thorough analysis of the given dataset. Unfortunately, given the timeframe, I could not work with all variables. But a lot of interesting patterns emerged from the analysis.

With the variables selected, I tried to fit the most strong ones into a linear regression model to try to replicate Prosper’s scoring method, but while the results are correlated with the Score, they are far from accurate, with an R2 of 0.15. We would need much more variables to get closer to the actual result.

I have tried to apply all knowledge from the course, using different visualization styles as the variable types required, and wrangling the data when necessary (not much wrangling was necessary since I’m using a curated dataset from Udacity)

I am very interested to further pursuing this analysis and see how much closer to the actual score I can get, given all variables, with other machine learning methods. And even see if I can come up with a more effective scoring method, based on the available data, that performs better than ProsperScore at predicting if a loan is going to be defaulted or not.